🎯 What We'll Cover
Week 5 introduced retrieval-augmented generation (RAG) as the engine behind the AI literature tools — the technique of fetching relevant documents and feeding them to a model so its answer is grounded in real sources rather than its training data alone. Those tools and that technique are still central. But three things have shifted since Week 5, and this sub-lesson is about all three.
First, long context: frontier models now routinely accept a million tokens or more, which changes when you need retrieval at all. Second, agentic RAG: retrieval has moved from a single fetch-then-answer step to a loop in which the model plans, retrieves, reads, and retrieves again — and the Deep Research modes you have heard about are exactly this. Third, evaluation: we now have frameworks for measuring whether a RAG system is any good, though the hardest cases remain open.
The honest verdict, which we build to: for reading one long document, just use long context; for synthesising across many sources, agentic RAG still wins; and for anything you will cite, the Week 5 verification protocols apply regardless of which method produced the answer. None of this removes the researcher from the loop.
📖 Recap: What Week 5 Established
Week 5 mapped a landscape of AI literature tools — Elicit, Consensus, NotebookLM, Perplexity, Connected Papers, ResearchRabbit — and a hard warning: the hallucinated-citation crisis is real. Both still hold in 2026. The tools have improved; the warning has not expired.
In the Week 9 taxonomy, hallucinated citations are a reduced-but-persistent failure: far less common on well-covered topics than in 2023, but still present, and concentrated exactly where research lives — the niche, the recent, the long-tail. RAG was always partly a response to this: ground the model in retrieved sources and it has less need to invent them. That logic is intact. What has changed is how the grounding is done.
📜 Change One: Long Context vs Retrieval
In Week 5, a model's context window was small enough that you had to retrieve: you could not simply paste a stack of papers in. By 2026, frontier models accept a million tokens or more (the Gemini Pro tier from Week 8 being the standard example) — enough for hundreds of pages at once. So a fair question is whether retrieval is still needed at all, or whether you should just dump everything into the context window and ask.
The most-cited study on this is Li et al. (2024), Retrieval Augmented Generation or Long-Context LLMs? (Google; EMNLP 2024). Their finding, in their words: “when resourced sufficiently, [long context] consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage.” They proposed a hybrid, “Self-Route”, that routes each query to whichever approach suits it — cutting cost while keeping long-context quality.
⚠️ Two caveats on the long-context win
First, that study is from 2024, tested on the long-context models of the time. Treat it as evidence of the shape of the trade-off, not the current margins — the Week 9 “which model, which date” discipline applies.
Second, and more important: an advertised context length is not the same as reliable recall across the whole of it. A model that accepts a million tokens does not necessarily use all of them well; recall tends to sag in the middle of very long inputs, and practitioners in 2026 still report that a model's true working-memory across a giant context is well short of the advertised number. “It fits” is not “it was read carefully.”
The practical upshot: for analysing a single long document where everything relevant is in front of the model, long context is simpler and usually better. For drawing on a large or open-ended body of sources — where you cannot fit it all, and where the model needs to decide what to look at next — you still want retrieval. And that is where the second change comes in.
🔁 Change Two: Agentic RAG
Classic RAG is a single shot: retrieve a set of passages once, then generate an answer from them. Its weakness is obvious for real research questions — if the first retrieval misses, the answer is built on the wrong sources, and the system never reconsiders. Agentic RAG turns the single shot into a loop.
The canonical reference is Singh et al. (2025), Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG, which describes how agentic RAG “transcends these limitations by embedding autonomous AI agents into the RAG pipeline.” In plain terms, the agent does what a careful researcher does:
- Plan — break the question into sub-questions.
- Retrieve — search for sources on the first sub-question.
- Read and reflect — work out what was learned and what is still missing.
- Retrieve again — run follow-up searches based on what the first round turned up (this is the step classic RAG cannot do).
- Synthesise and check — assemble an answer and, in better systems, verify claims against the retrieved sources.
This loop is just the agent definition from 10.1 (model + tools + loop + memory) pointed at a retrieval tool. And it is not a research curiosity: the “Deep Research” modes in the major assistants — which take a question, spend several minutes searching and reading, and return a cited report — are agentic RAG in consumer form. When you use one, you are running a planner-retriever-synthesiser loop, with all the strengths and the 10.2 reliability caveats that implies. We compare the specific Deep Research tools, free tiers included, in 10.5.
📏 Change Three: Can You Tell If It's Any Good?
If a RAG system can retrieve the wrong sources and still produce a fluent answer, how do you measure quality? The most widely-adopted answer is RAGAS (Es, James, Espinosa-Anke & Schockaert, 2023) — a framework that scores a RAG system without needing hand-written gold answers for every question. Its widely-used metrics separate the two places RAG can fail:
- Context precision & context recall — did the retrieval step fetch the relevant passages, and only those?
- Faithfulness — is the generated answer actually supported by the retrieved passages, or did the model add unsupported claims?
- Answer relevancy — does the answer actually address the question asked?
⚠️ What these metrics do and do not tell you
The crucial limitation: faithfulness is not correctness. An answer can be perfectly faithful to its retrieved passages — inventing nothing beyond them — and still be wrong, because the passages themselves were wrong, outdated, or cherry-picked. RAGAS checks that the model did not hallucinate beyond its sources; it cannot check that the sources were right. That second check is yours.
Two further caveats: RAGAS uses a strong “judge” model to score, so it inherits that model's blind spots; and the hardest cases — multi-hop questions that require chaining several retrievals, and judging how well an agent compacted a huge context — remain operationally hard to evaluate even in 2026.
⚖️ The Honest Verdict for May 2026
Pulling the three changes together, here is how to choose, as a researcher, for a given task:
| Your task | Best approach (May 2026) | Why |
|---|---|---|
| Analyse one long document (a thesis, a report, a contract) | Long context — paste it in | Everything relevant fits; retrieval adds complexity for no gain. |
| Synthesise across many sources or the open literature | Agentic RAG / Deep Research | The model needs to decide what to read next; follow-up retrieval is the whole point. |
| Anything you will cite in your own work | Either — then verify by hand | The Week 5 citation checks apply regardless of method. Faithful ≠ correct. |
| High-volume / repeated querying on a budget | RAG | RAG's lower cost is its standing advantage (Li et al.); stuffing a million tokens in per query is expensive. |
🧠 A Week 9.4 warning that sharpens here
Messeri & Crockett warned (Week 9.4) that AI can create “monocultures of knowing” — everyone funnelled toward the same sources and framings. Agentic RAG intensifies the risk, because now the agent decides what to retrieve, what counts as relevant, and what to leave out — and it does so invisibly, in a loop you do not see. A Deep Research report that reads as comprehensive may rest on a narrow, agent-chosen slice of the literature. The friction that retrieval used to add — you, deciding what to read — was doing useful epistemic work. Automating it away is convenient and quietly costly.
🌍 RAG, Local Corpora, and the Equity Angle
There is a genuinely hopeful version of RAG for African and other under-represented research contexts. Frontier models are trained predominantly on high-resource languages and well-digitised corpora; they know comparatively little about, say, isiXhosa scholarship or region-specific datasets. RAG offers a route around this: ground the model in a local corpus it was never trained well on, and it can work with material outside its training distribution.
The catch is in the retrieval layer. RAG works by converting text into embeddings (numerical representations) and matching on them — and the embedding models are themselves trained mostly on high-resource languages. If the embedding layer cannot represent isiXhosa or Setswana well, retrieval over an isiXhosa corpus will be poor no matter how good the generating model is. So the equity promise is real but conditional: RAG can ground frontier models in local knowledge only as well as the retrieval layer handles the local language, and as of 2026 most embedding models handle African languages poorly. This connects directly to the sovereign-capacity questions of Week 11 — the Esethu Framework (Rajab et al., 2025) argues for exactly this kind of locally-grounded infrastructure as a precondition for equitable AI, not an afterthought.
📖 Sources & Further Reading
- Singh, A., Ehtesham, A., Kumar, S., Talaei Khoei, T., & Vasilakos, A. V. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136 — the canonical agentic-RAG reference.
- Li, Z., Li, C., Zhang, M., Mei, Q., & Bendersky, M. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. arXiv:2407.16833 (Google; EMNLP 2024) — the long-context-vs-RAG trade-off and the Self-Route hybrid. (2024 evidence — read for shape, not current margins.)
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217 — the RAG evaluation framework; metric definitions at docs.ragas.io.
- Esethu Framework (Rajab et al., 2025). arXiv:2502.15916 — locally-grounded infrastructure for equitable AI (introduced in Week 4; expanded in Week 11).
👉 What Comes Next
Sub-Lesson 10.5 — Advanced Research Tools: A Curated Tour. We have now built up the concepts (10.1–10.2) and the categories (10.3–10.4). The next sub-lesson is the practical pay-off: a tool-by-tool tour with an honest free-versus-paid split, including the Deep Research modes that implement the agentic RAG from this lesson, the Chinese free-tier options, and the MCP connectors that plug agents into your research tools — all framed by the question that runs through this week: what can you actually use, from here, for free?